First, we start loading the data
cols(
id = col_double(),
listing_url = col_character(),
scrape_id = col_double(),
last_scraped = col_date(format = ""),
name = col_character(),
description = col_character(),
neighborhood_overview = col_character(),
picture_url = col_character(),
host_id = col_double(),
host_url = col_character(),
host_name = col_character(),
host_since = col_date(format = ""),
host_location = col_character(),
host_about = col_character(),
host_response_time = col_character(),
host_response_rate = col_character(),
host_acceptance_rate = col_character(),
host_is_superhost = col_logical(),
host_thumbnail_url = col_character(),
host_picture_url = col_character(),
host_neighbourhood = col_character(),
host_listings_count = col_double(),
host_total_listings_count = col_double(),
host_verifications = col_character(),
host_has_profile_pic = col_logical(),
host_identity_verified = col_logical(),
neighbourhood = col_character(),
neighbourhood_cleansed = col_character(),
neighbourhood_group_cleansed = col_character(),
latitude = col_double(),
longitude = col_double(),
property_type = col_character(),
room_type = col_character(),
accommodates = col_double(),
bathrooms = col_logical(),
bathrooms_text = col_character(),
bedrooms = col_double(),
beds = col_double(),
amenities = col_character(),
price = col_character(),
minimum_nights = col_double(),
maximum_nights = col_double(),
minimum_minimum_nights = col_double(),
maximum_minimum_nights = col_double(),
minimum_maximum_nights = col_double(),
maximum_maximum_nights = col_double(),
minimum_nights_avg_ntm = col_double(),
maximum_nights_avg_ntm = col_double(),
calendar_updated = col_logical(),
has_availability = col_logical(),
availability_30 = col_double(),
availability_60 = col_double(),
availability_90 = col_double(),
availability_365 = col_double(),
calendar_last_scraped = col_date(format = ""),
number_of_reviews = col_double(),
number_of_reviews_ltm = col_double(),
number_of_reviews_l30d = col_double(),
first_review = col_date(format = ""),
last_review = col_date(format = ""),
review_scores_rating = col_double(),
review_scores_accuracy = col_double(),
review_scores_cleanliness = col_double(),
review_scores_checkin = col_double(),
review_scores_communication = col_double(),
review_scores_location = col_double(),
review_scores_value = col_double(),
license = col_character(),
instant_bookable = col_logical(),
calculated_host_listings_count = col_double(),
calculated_host_listings_count_entire_homes = col_double(),
calculated_host_listings_count_private_rooms = col_double(),
calculated_host_listings_count_shared_rooms = col_double(),
reviews_per_month = col_double()
)
Then, we select some variables to our data frame, recalculate in proper way, make it ready to use and count the room available for each property type.
subset_listings <- listings %>%
dplyr::select(host_since, host_response_rate, host_is_superhost, host_listings_count, latitude, longitude, room_type, accommodates, bedrooms, beds,price, minimum_nights, maximum_nights, number_of_reviews, number_of_reviews_ltm, review_scores_rating, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, review_scores_accuracy, instant_bookable,property_type)
subset_listings$price = as.double(substr(subset_listings$price,2,nchar(subset_listings$price)))
subset_listings$host_since = as.numeric(subset_listings$host_since)
subset_listings$host_response_rate = as.numeric(sub("%","",subset_listings$host_response_rate))/100
subset_listings$host_is_superhost = as.integer(subset_listings$host_is_superhost)
subset_listings$instant_bookable = as.integer(subset_listings$instant_bookable)
k<-subset_listings %>%
count(property_type) %>%
arrange(desc(n))
head(k)| property_type | n |
|---|---|
| Entire rental unit | 9404 |
| Private room in rental unit | 5311 |
| Private room in residential home | 816 |
| Entire loft | 603 |
| Entire condominium (condo) | 434 |
| Private room in condominium (condo) | 343 |
subset_listings <- subset_listings %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Entire rental unit","Private room in rental unit","Private room in residential home","Entire loft") ~ property_type,
TRUE ~ "Other"
)) %>%
dplyr::select(-property_type)
subset_listings_categorical <- subset_listings %>%
dplyr::select(host_is_superhost, room_type, accommodates, instant_bookable,prop_type_simplified)
subset_listings_continous <- subset_listings %>%
dplyr::select(host_response_rate,host_since,host_listings_count, latitude, longitude, accommodates, bedrooms, beds,price, minimum_nights, maximum_nights, number_of_reviews, number_of_reviews_ltm, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, review_scores_accuracy)To create a full set of dummy variables,
dmy <- dummyVars(" ~ .", data = subset_listings)
subset_listings_finished <- data.frame(predict(dmy, newdata = subset_listings))
head(subset_listings_finished)| host_since | host_response_rate | host_is_superhost | host_listings_count | latitude | longitude | room_typeEntire.home.apt | room_typeHotel.room | room_typePrivate.room | room_typeShared.room | accommodates | bedrooms | beds | price | minimum_nights | maximum_nights | number_of_reviews | number_of_reviews_ltm | review_scores_rating | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | review_scores_accuracy | instant_bookable | prop_type_simplifiedEntire.loft | prop_type_simplifiedEntire.rental.unit | prop_type_simplifiedOther | prop_type_simplifiedPrivate.room.in.rental.unit | prop_type_simplifiedPrivate.room.in.residential.home |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.44e+04 | 1 | 1 | 1 | 40.5 | -3.68 | 0 | 0 | 1 | 0 | 2 | 1 | 1 | 60 | 1 | 1.12e+03 | 80 | 4 | 4.87 | 4.81 | 4.8 | 4.89 | 4.77 | 4.85 | 4.91 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1.47e+04 | 0 | 2 | 40.4 | -3.74 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 31 | 4 | 40 | 33 | 0 | 4.58 | 4.56 | 4.75 | 4.82 | 4.21 | 4.67 | 4.72 | 0 | 0 | 0 | 0 | 1 | 0 | |
| 1.47e+04 | 0.97 | 0 | 10 | 40.4 | -3.7 | 1 | 0 | 0 | 0 | 6 | 3 | 5 | 50 | 15 | 730 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | |||||||
| 1.84e+04 | 0 | 1 | 40.4 | -3.71 | 1 | 0 | 0 | 0 | 3 | 1 | 92 | 5 | 1.12e+03 | 12 | 2 | 4.92 | 4.83 | 5 | 5 | 5 | 4.83 | 4.75 | 0 | 0 | 1 | 0 | 0 | 0 | ||
| 1.47e+04 | 1 | 0 | 1 | 40.4 | -3.69 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 26 | 2 | 1.12e+03 | 149 | 0 | 4.68 | 4.92 | 4.78 | 4.71 | 4.7 | 4.72 | 4.8 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1.47e+04 | 0.97 | 0 | 10 | 40.4 | -3.69 | 1 | 0 | 0 | 0 | 4 | 3 | 3 | 85 | 1 | 365 | 6 | 3 | 4 | 4.2 | 4.8 | 4.4 | 4.2 | 4.4 | 4.2 | 0 | 0 | 1 | 0 | 0 | 0 |
Conduct a thorough EDA. Recall that an EDA involves three things:
dplyr::glimpse()dplyr::glimpse(subset_listings)Rows: 18,909
Columns: 24
$ host_since <dbl> 14350, 14661, 14657, 18397, 14726, 14657, ~
$ host_response_rate <dbl> 1.00, NA, 0.97, NA, 1.00, 0.97, 0.25, NA, ~
$ host_is_superhost <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, ~
$ host_listings_count <dbl> 1, 2, 10, 1, 1, 10, 3, 4, 1, 1, 12, 3, 1, ~
$ latitude <dbl> 40.45724, 40.40381, 40.38840, 40.42183, 40~
$ longitude <dbl> -3.67688, -3.74130, -3.69511, -3.70529, -3~
$ room_type <chr> "Private room", "Private room", "Entire ho~
$ accommodates <dbl> 2, 1, 6, 3, 1, 4, 2, 1, 2, 5, 4, 3, 3, 1, ~
$ bedrooms <dbl> 1, 1, 3, NA, 1, 3, 1, 1, 1, 2, 1, 1, 2, 1,~
$ beds <dbl> 1, 1, 5, 1, 1, 3, 2, 1, 1, 3, 2, 3, 2, 1, ~
$ price <dbl> 60, 31, 50, 92, 26, 85, 65, 15, 54, NA, 81~
$ minimum_nights <dbl> 1, 4, 15, 5, 2, 1, 5, 1, 3, 3, 3, 5, 10, 1~
$ maximum_nights <dbl> 1125, 40, 730, 1125, 1125, 365, 180, 1124,~
$ number_of_reviews <dbl> 80, 33, 0, 12, 149, 6, 170, 6, 8, 0, 118, ~
$ number_of_reviews_ltm <dbl> 4, 0, 0, 2, 0, 3, 0, 2, 0, 0, 0, 0, 0, 0, ~
$ review_scores_rating <dbl> 4.87, 4.58, NA, 4.92, 4.68, 4.00, 4.64, 5.~
$ review_scores_cleanliness <dbl> 4.81, 4.56, NA, 4.83, 4.92, 4.20, 4.89, 4.~
$ review_scores_checkin <dbl> 4.80, 4.75, NA, 5.00, 4.78, 4.80, 4.84, 4.~
$ review_scores_communication <dbl> 4.89, 4.82, NA, 5.00, 4.71, 4.40, 4.80, 4.~
$ review_scores_location <dbl> 4.77, 4.21, NA, 5.00, 4.70, 4.20, 4.90, 5.~
$ review_scores_value <dbl> 4.85, 4.67, NA, 4.83, 4.72, 4.40, 4.71, 4.~
$ review_scores_accuracy <dbl> 4.91, 4.72, NA, 4.75, 4.80, 4.20, 4.73, 4.~
$ instant_bookable <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ~
$ prop_type_simplified <chr> "Private room in rental unit", "Private ro~
after we had selected the raw data to our dara frame, there are 24 variables and 18,909 observations in total. There are 22 variables in numeric (e.g., price or maximum_nights) and variables in character (e.g., room_type or prop_type_simplified).
Computing summary statistics of the variables of interest, or finding NAs
mosaic::favstats()favstats(~review_scores_rating, data=subset_listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 4.5 | 4.75 | 4.94 | 5 | 4.55 | 0.783 | 13877 | 5032 |
favstats(~price, data=subset_listings)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 36 | 63 | 102 | 999 | 94.9 | 114 | 18637 | 272 |
skimr::skim()skimr::skim(subset_listings)| Name | subset_listings |
| Number of rows | 18909 |
| Number of columns | 24 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 22 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| room_type | 0 | 1 | 10 | 15 | 0 | 4 | 0 |
| prop_type_simplified | 0 | 1 | 5 | 32 | 0 | 5 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| host_since | 26 | 1.00 | 17108.95 | 927.82 | 14290.00 | 16476.00 | 17086.00 | 17876.00 | 18879.00 | <U+2581><U+2583><U+2587><U+2587><U+2586> |
| host_response_rate | 6937 | 0.63 | 0.88 | 0.26 | 0.00 | 0.92 | 1.00 | 1.00 | 1.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| host_is_superhost | 26 | 1.00 | 0.18 | 0.39 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
| host_listings_count | 26 | 1.00 | 11.61 | 32.34 | 0.00 | 1.00 | 2.00 | 6.00 | 515.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| latitude | 0 | 1.00 | 40.42 | 0.02 | 40.33 | 40.41 | 40.42 | 40.43 | 40.57 | <U+2581><U+2587><U+2583><U+2581><U+2581> |
| longitude | 0 | 1.00 | -3.69 | 0.03 | -3.89 | -3.71 | -3.70 | -3.69 | -3.55 | <U+2581><U+2581><U+2587><U+2582><U+2581> |
| accommodates | 0 | 1.00 | 3.09 | 1.96 | 0.00 | 2.00 | 2.00 | 4.00 | 16.00 | <U+2587><U+2585><U+2581><U+2581><U+2581> |
| bedrooms | 1434 | 0.92 | 1.44 | 0.83 | 1.00 | 1.00 | 1.00 | 2.00 | 18.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds | 341 | 0.98 | 1.92 | 1.45 | 0.00 | 1.00 | 1.00 | 2.00 | 24.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| price | 272 | 0.99 | 94.87 | 113.76 | 0.00 | 36.00 | 63.00 | 102.00 | 999.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights | 0 | 1.00 | 7.30 | 36.01 | 1.00 | 1.00 | 2.00 | 3.00 | 1125.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights | 0 | 1.00 | 1370.15 | 81133.97 | 1.00 | 90.00 | 1125.00 | 1125.00 | 11111111.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews | 0 | 1.00 | 32.71 | 65.25 | 0.00 | 0.00 | 5.00 | 32.00 | 744.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews_ltm | 0 | 1.00 | 3.97 | 9.94 | 0.00 | 0.00 | 0.00 | 3.00 | 256.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 5032 | 0.73 | 4.55 | 0.78 | 0.00 | 4.50 | 4.75 | 4.94 | 5.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_cleanliness | 5269 | 0.72 | 4.66 | 0.51 | 1.00 | 4.56 | 4.80 | 4.98 | 5.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_checkin | 5269 | 0.72 | 4.78 | 0.43 | 1.00 | 4.75 | 4.90 | 5.00 | 5.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_communication | 5269 | 0.72 | 4.78 | 0.46 | 1.00 | 4.75 | 4.92 | 5.00 | 5.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_location | 5272 | 0.72 | 4.79 | 0.38 | 1.00 | 4.73 | 4.92 | 5.00 | 5.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_value | 5273 | 0.72 | 4.59 | 0.50 | 1.00 | 4.50 | 4.71 | 4.88 | 5.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_accuracy | 5271 | 0.72 | 4.72 | 0.47 | 1.00 | 4.67 | 4.86 | 5.00 | 5.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| instant_bookable | 0 | 1.00 | 0.49 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | <U+2587><U+2581><U+2581><U+2581><U+2587> |
ggplot2::ggplot()
geom_histogram() or geom_density() for numeric continuous variablescol_name = colnames(subset_listings_continous)
require(cowplot)
for(i in 1:19){
assign(paste("g", i, sep = ""), i)
}
k<-0
for (i in col_name){
k<-k+1
assign(paste("g", k, sep = ""), ggplot(subset_listings_continous, aes_string(x=i)) +
geom_density(alpha=0.2)+
theme_bw() + #theme
labs (
title = paste("Density Plot for ", i),
y = "Density"
))
}
plot_grid(g1,g2,g3,g4,g5,g6,nrow = 2)plot_grid(g7,g8,g9,g10,g11,g12,nrow = 2)plot_grid(g13,g14,g15,g16,g17,g18,g19,nrow = 3) * `geom_bar()` or `geom_col()` for categorical variables
col_name = colnames(subset_listings_categorical)
k<-0
for (i in col_name){
k<-k+1
assign(paste("g", k, sep = ""),
ggplot(subset_listings_categorical, aes_string(x=i)) +
geom_bar()+
theme_bw() + #theme
theme(text = element_text(size=10),
axis.text.x = element_text(angle=45, hjust=1)) +
labs (
title = paste("Barplot for ", i),
y = "Count"
))
}
plot_grid(g1,g2,g4,nrow = 1)plot_grid(g3,g5,nrow = 1)* `GGally::ggpairs()` for scaterrlot/correlation matrix
# Scatterplots across all variables
subset_listings_finished <-subset_listings_finished %>%
na.omit(subset_listings_finished)
par(cex = 0.7)
corrplot(cor(subset_listings_continous,use="pairwise.complete.obs"),type="upper", order="hclust", tl.col="black", tl.srt=45, cl.cex = 1/par("cex"))glimpse the data
glimpse(listings)Rows: 18,909
Columns: 74
$ id <dbl> 6369, 21853, 23001, 24805~
$ listing_url <chr> "https://www.airbnb.com/r~
$ scrape_id <dbl> 2.021091e+13, 2.021091e+1~
$ last_scraped <date> 2021-09-11, 2021-09-11, ~
$ name <chr> "Rooftop terrace room , ~
$ description <chr> "Excellent connection wit~
$ neighborhood_overview <chr> NA, "We live in a leafy n~
$ picture_url <chr> "https://a0.muscache.com/~
$ host_id <dbl> 13660, 83531, 82175, 3463~
$ host_url <chr> "https://www.airbnb.com/u~
$ host_name <chr> "Simon", "Abdel", "Jesus"~
$ host_since <date> 2009-04-16, 2010-02-21, ~
$ host_location <chr> "Madrid, Community of Mad~
$ host_about <chr> "Gay couple, heterofriend~
$ host_response_time <chr> "within a few hours", "N/~
$ host_response_rate <chr> "100%", "N/A", "97%", "N/~
$ host_acceptance_rate <chr> "83%", "N/A", "64%", "N/A~
$ host_is_superhost <lgl> TRUE, FALSE, FALSE, FALSE~
$ host_thumbnail_url <chr> "https://a0.muscache.com/~
$ host_picture_url <chr> "https://a0.muscache.com/~
$ host_neighbourhood <chr> "Hispanoamérica", "Aluche~
$ host_listings_count <dbl> 1, 2, 10, 1, 1, 10, 3, 4,~
$ host_total_listings_count <dbl> 1, 2, 10, 1, 1, 10, 3, 4,~
$ host_verifications <chr> "['email', 'phone', 'revi~
$ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ host_identity_verified <lgl> TRUE, TRUE, TRUE, FALSE, ~
$ neighbourhood <chr> NA, "Madrid, Spain", "Mad~
$ neighbourhood_cleansed <chr> "Hispanoamérica", "Cármen~
$ neighbourhood_group_cleansed <chr> "Chamartín", "Latina", "A~
$ latitude <dbl> 40.45724, 40.40381, 40.38~
$ longitude <dbl> -3.67688, -3.74130, -3.69~
$ property_type <chr> "Private room in rental u~
$ room_type <chr> "Private room", "Private ~
$ accommodates <dbl> 2, 1, 6, 3, 1, 4, 2, 1, 2~
$ bathrooms <lgl> NA, NA, NA, NA, NA, NA, N~
$ bathrooms_text <chr> "1 shared bath", "1 bath"~
$ bedrooms <dbl> 1, 1, 3, NA, 1, 3, 1, 1, ~
$ beds <dbl> 1, 1, 5, 1, 1, 3, 2, 1, 1~
$ amenities <chr> "[\"Wifi\", \"Shampoo\", ~
$ price <chr> "$60.00", "$31.00", "$50.~
$ minimum_nights <dbl> 1, 4, 15, 5, 2, 1, 5, 1, ~
$ maximum_nights <dbl> 1125, 40, 730, 1125, 1125~
$ minimum_minimum_nights <dbl> 1, 4, 15, 5, 2, 1, 5, 1, ~
$ maximum_minimum_nights <dbl> 1, 4, 15, 5, 2, 1, 5, 1, ~
$ minimum_maximum_nights <dbl> 1125, 40, 730, 1125, 1125~
$ maximum_maximum_nights <dbl> 1125, 40, 730, 1125, 1125~
$ minimum_nights_avg_ntm <dbl> 1.0, 4.0, 15.0, 5.0, 2.0,~
$ maximum_nights_avg_ntm <dbl> 1125, 40, 730, 1125, 1125~
$ calendar_updated <lgl> NA, NA, NA, NA, NA, NA, N~
$ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ availability_30 <dbl> 30, 29, 0, 1, 14, 0, 0, 0~
$ availability_60 <dbl> 60, 59, 0, 1, 44, 0, 9, 1~
$ availability_90 <dbl> 90, 89, 0, 25, 74, 0, 39,~
$ availability_365 <dbl> 180, 364, 222, 115, 349, ~
$ calendar_last_scraped <date> 2021-09-11, 2021-09-11, ~
$ number_of_reviews <dbl> 80, 33, 0, 12, 149, 6, 17~
$ number_of_reviews_ltm <dbl> 4, 0, 0, 2, 0, 3, 0, 2, 0~
$ number_of_reviews_l30d <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ first_review <date> 2016-03-31, 2014-10-10, ~
$ last_review <date> 2019-05-14, 2018-05-29, ~
$ review_scores_rating <dbl> 4.87, 4.58, NA, 4.92, 4.6~
$ review_scores_accuracy <dbl> 4.91, 4.72, NA, 4.75, 4.8~
$ review_scores_cleanliness <dbl> 4.81, 4.56, NA, 4.83, 4.9~
$ review_scores_checkin <dbl> 4.80, 4.75, NA, 5.00, 4.7~
$ review_scores_communication <dbl> 4.89, 4.82, NA, 5.00, 4.7~
$ review_scores_location <dbl> 4.77, 4.21, NA, 5.00, 4.7~
$ review_scores_value <dbl> 4.85, 4.67, NA, 4.83, 4.7~
$ license <chr> NA, NA, NA, NA, NA, NA, N~
$ instant_bookable <lgl> FALSE, FALSE, FALSE, FALS~
$ calculated_host_listings_count <dbl> 2, 2, 5, 1, 1, 5, 3, 4, 1~
$ calculated_host_listings_count_entire_homes <dbl> 0, 0, 4, 1, 0, 4, 3, 0, 1~
$ calculated_host_listings_count_private_rooms <dbl> 2, 2, 1, 0, 1, 1, 0, 4, 0~
$ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ reviews_per_month <dbl> 1.21, 0.39, NA, 0.51, 1.6~
To drop any non-numeric characters
listings <- listings %>%
mutate(price = parse_number(price))
typeof(listings$price)[1] "double"
Ranking the most popular of property types. Top 4 are accounting for 85% of total.
#count the top 4 most common property types
prop<-listings %>%
count(property_type) %>%
arrange(desc(n))
head(prop)| property_type | n |
|---|---|
| Entire rental unit | 9404 |
| Private room in rental unit | 5311 |
| Private room in residential home | 816 |
| Entire loft | 603 |
| Entire condominium (condo) | 434 |
| Private room in condominium (condo) | 343 |
#the proportion of the total listings the 4 most common property types make up
sum(prop$n[1:4])/sum(prop$n)[1] 0.8532445
##the 4 most common property types make up 85.32%.
#create new variable: prop_type_simplified
listings <- listings %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Entire rental unit","Private room in rental unit","Private room in residential home","Entire loft") ~ property_type,
TRUE ~ "Other"
))To check the variables in prop_type_simplified are correct,
k<-listings %>%
count(property_type, prop_type_simplified) %>%
arrange(desc(n))
head(k)| property_type | prop_type_simplified | n |
|---|---|---|
| Entire rental unit | Entire rental unit | 9404 |
| Private room in rental unit | Private room in rental unit | 5311 |
| Private room in residential home | Private room in residential home | 816 |
| Entire loft | Entire loft | 603 |
| Entire condominium (condo) | Other | 434 |
| Private room in condominium (condo) | Other | 343 |
Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes:
minimum_nights?k<-listings %>%
count(minimum_nights) %>%
arrange(desc(n))
head(k,)| minimum_nights | n |
|---|---|
| 1 | 7560 |
| 2 | 4630 |
| 3 | 2673 |
| 4 | 706 |
| 5 | 598 |
| 7 | 597 |
#1 and 2 are the most common values for the variable `minimum_nights`30 days stand out.
minimum_nights?This could be because someone wants to rent a house through Airbnb
Filter the airbnb data so that it only includes observations with minimum_nights <= 4
listings<-listings %>%
filter(minimum_nights <= 4)Visualisations of feature distributions and their relations are key to understanding a data set, and they can open up new lines of exploration. While we do not have time to go into all the wonderful geospatial visualisations one can do with R, you can use the following code to start with a map of your city, and overlay all AirBnB coordinates to get an overview of the spatial distribution of AirBnB rentals. For this visualisation we use the leaflet package, which includes a variety of tools for interactive maps, so you can easily zoom in-out, click on a point to get the actual AirBnB listing for that specific point, etc.
The following code, having downloaded a dataframe listings with all AirbnB listings in Milan, will plot on the map all AirBnBs where minimum_nights is less than equal to four (4). You could learn more about leaflet, by following the relevant Datacamp course on mapping with leaflet
leaflet(data = filter(listings, minimum_nights <= 4)) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
fillColor = "blue",
fillOpacity = 0.4,
popup = ~listing_url,
label = ~property_type)For the target variable \(Y\), we will use the cost for two people to stay at an Airbnb location for four (4) nights.
Create a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.
##listings cleaning: only include the data that is intended for travel purposes
data0<-listings %>%
filter(accommodates>1,minimum_nights<=4,maximum_nights>=4,price>0,availability_30>=4)%>%
mutate(price_4_nights=case_when(
room_type=="Shared room"~price/accommodates*2*4,
TRUE ~ price*4
))
data<-data0
#data cleaning
data<-data[complete.cases(data[ , c(36,37,38,61,62)]),]
data <- data %>%
mutate(bathrooms = parse_number(bathrooms_text))
data$host_has_profile_pic<-as.numeric(data$host_has_profile_pic)
data$host_identity_verified<-as.numeric(data$host_identity_verified)
data$host_is_superhost<-as.numeric(data$host_is_superhost)
data$has_availability<-as.numeric(data$has_availability)
data$instant_bookable<-as.numeric(data$instant_bookable)
data<-data %>%
mutate(weeks=difftime( "2021-9-30",data$host_since ,units = "weeks"))
data<- data %>%
mutate(weeks = parse_number(as.character(weeks)))
data<-data %>%
filter(price_4_nights>55) %>%
filter(price_4_nights<4000)Use histograms or density plots to examine the distributions of price_4_nights and log(price_4_nights). Which variable should you use for the regression model? Why?
data %>%
ggplot(aes(x=price_4_nights))+
geom_histogram()+
theme_bw()+
#name the graph
labs (
title = "Distribution of Price",
y = "count",
x = "price_4_nights"
)+
NULLdata %>%
ggplot( aes(x=log(price_4_nights)))+
geom_histogram()+
theme_bw()+
#name the graph
labs (
title = "Distribution of log(Price)",
y = "count",
x = "log(price_4_nights)"
)+
NULL#almost normally distributedFit a regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.
model1<-lm(log(price_4_nights)~prop_type_simplified+number_of_reviews+review_scores_rating,data)
summary(model1)
Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
review_scores_rating, data = data)
Residuals:
Min 1Q Median 3Q Max
-1.68619 -0.36154 -0.05145 0.28801 2.56814
Coefficients:
Estimate Std. Error
(Intercept) 5.1371824 0.0914471
prop_type_simplifiedEntire rental unit 0.1905927 0.0479865
prop_type_simplifiedOther -0.0914510 0.0519197
prop_type_simplifiedPrivate room in rental unit -0.7529075 0.0511307
prop_type_simplifiedPrivate room in residential home -0.7226708 0.0681376
number_of_reviews -0.0012297 0.0001007
review_scores_rating 0.1603604 0.0168643
t value Pr(>|t|)
(Intercept) 56.177 < 2e-16 ***
prop_type_simplifiedEntire rental unit 3.972 7.24e-05 ***
prop_type_simplifiedOther -1.761 0.0782 .
prop_type_simplifiedPrivate room in rental unit -14.725 < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -10.606 < 2e-16 ***
number_of_reviews -12.217 < 2e-16 ***
review_scores_rating 9.509 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5401 on 4509 degrees of freedom
Multiple R-squared: 0.3206, Adjusted R-squared: 0.3197
F-statistic: 354.7 on 6 and 4509 DF, p-value: < 2.2e-16
Interpret the coefficient prop_type_simplifiedEntire rental unit in terms of price_4_nights. Holding the other variables, if the property is Entire rental instead of Entire loft , price_4_nights will increase by 20.9967% (because a logarithmic transformation is performed on ‘price_4_nights’ variable).
Interpret the coefficient prop_type_simplifiedOther in terms of price_4_nights. Holding the other variables, if the property is Other instead of Entire loft , price_4_nights will decrease by 9.5763% (because a logarithmic transformation is performed on ‘price_4_nights’ variable).
Interpret the coefficient prop_type_simplifiedPrivate room in rental unit in terms of price_4_nights. Holding the other variables, if the property is Private room in rental instead of Entire loft, price_4_nights will decrease by 112.3164% (because a logarithmic transformation is performed on ‘price_4_nights’ variable).
Interpret the coefficient prop_type_simplifiedPrivate room in residential home in terms of price_4_nights. Holding the other variables, if the property is Private room in residential home instead of Entire loft, price_4_nights will decrease by 105.9928% (because a logarithmic transformation is performed on ‘price_4_nights’ variable).
Interpret the coefficient of number_of_reviews in terms of price_4_nights. Holding the other variables, each unit increase in number_of_reviews will increase price_4_nights by 117.3934% (because a logarithmic transformation is performed on ‘price_4_nights’ variable).
Interpret the coefficient review_scores_rating in terms of price_4_nights. Holding the other variables, each unit increase in review_scores_rating will increase price_4_nights by 0.16925% (because a logarithmic transformation is performed on ‘price_4_nights’ variable).
We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. Fit a regression model called model2 that includes all of the explananatory variables in model1 plus room_type.
model2<-lm(log(price_4_nights)~prop_type_simplified+room_type+number_of_reviews+review_scores_rating,data)
summary(model2)
Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + room_type +
number_of_reviews + review_scores_rating, data = data)
Residuals:
Min 1Q Median 3Q Max
-1.42481 -0.36193 -0.05797 0.27520 2.84489
Coefficients:
Estimate Std. Error
(Intercept) 5.171e+00 8.901e-02
prop_type_simplifiedEntire rental unit 1.905e-01 4.669e-02
prop_type_simplifiedOther 1.977e-01 5.511e-02
prop_type_simplifiedPrivate room in rental unit -1.287e-01 6.693e-02
prop_type_simplifiedPrivate room in residential home -9.894e-02 7.998e-02
room_typeHotel room -1.805e-02 7.772e-02
room_typePrivate room -6.241e-01 4.475e-02
room_typeShared room -1.205e+00 1.349e-01
number_of_reviews -1.218e-03 9.795e-05
review_scores_rating 1.530e-01 1.642e-02
t value Pr(>|t|)
(Intercept) 58.094 < 2e-16 ***
prop_type_simplifiedEntire rental unit 4.079 4.6e-05 ***
prop_type_simplifiedOther 3.587 0.000338 ***
prop_type_simplifiedPrivate room in rental unit -1.922 0.054665 .
prop_type_simplifiedPrivate room in residential home -1.237 0.216153
room_typeHotel room -0.232 0.816404
room_typePrivate room -13.945 < 2e-16 ***
room_typeShared room -8.934 < 2e-16 ***
number_of_reviews -12.431 < 2e-16 ***
review_scores_rating 9.318 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5254 on 4506 degrees of freedom
Multiple R-squared: 0.3574, Adjusted R-squared: 0.3561
F-statistic: 278.4 on 9 and 4506 DF, p-value: < 2.2e-16
anova(model1,model2)| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 4.51e+03 | 1.32e+03 | ||||
| 4.51e+03 | 1.24e+03 | 3 | 71.1 | 85.8 | 5.42e-54 |
vif(model2) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 6.774042 4 1.270153
room_type 6.672320 3 1.372080
number_of_reviews 1.033018 1 1.016375
review_scores_rating 1.011501 1 1.005734
##However, there is multicollinearityFrom the comparison, F = 85.812 with a p-value = 2.2e-16. Therefore, room_type is significant. However, there is multicollinearity
Our dataset has many more variables, so here are some ideas on how you can extend your analysis
bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights? Or might these be co-linear variables?##correlation between `bathrooms`, `bedrooms`, `beds`, or size of the house (`accomodates`)
corr<- data %>%
dplyr::select(bathrooms,bedrooms,beds,accommodates,number_of_reviews,review_scores_rating)
corr<-corr[complete.cases(corr),]
library(corrplot)
corrplot(corr = cor(corr),order ="AOE", addCoef.col="grey")#`beds` ,`bedrooms`and `accomodates` are highly correlated
data3<-data %>%
dplyr::select(prop_type_simplified,bathrooms,bedrooms,beds,accommodates,number_of_reviews,review_scores_rating,price_4_nights)
model3<-lm(log(price_4_nights)~prop_type_simplified+bathrooms+bedrooms+beds+accommodates+number_of_reviews+review_scores_rating,data3)
summary(model3)
Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + bathrooms +
bedrooms + beds + accommodates + number_of_reviews + review_scores_rating,
data = data3)
Residuals:
Min 1Q Median 3Q Max
-3.16083 -0.28587 -0.03412 0.25084 2.68955
Coefficients:
Estimate Std. Error
(Intercept) 4.630e+00 8.122e-02
prop_type_simplifiedEntire rental unit 6.255e-02 4.201e-02
prop_type_simplifiedOther -9.805e-02 4.546e-02
prop_type_simplifiedPrivate room in rental unit -6.187e-01 4.509e-02
prop_type_simplifiedPrivate room in residential home -6.021e-01 5.998e-02
bathrooms 1.380e-01 1.361e-02
bedrooms 5.619e-02 1.246e-02
beds 6.901e-03 8.688e-03
accommodates 8.175e-02 7.264e-03
number_of_reviews -1.209e-03 8.789e-05
review_scores_rating 1.509e-01 1.472e-02
t value Pr(>|t|)
(Intercept) 57.003 < 2e-16 ***
prop_type_simplifiedEntire rental unit 1.489 0.137
prop_type_simplifiedOther -2.157 0.031 *
prop_type_simplifiedPrivate room in rental unit -13.720 < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -10.039 < 2e-16 ***
bathrooms 10.138 < 2e-16 ***
bedrooms 4.509 6.66e-06 ***
beds 0.794 0.427
accommodates 11.255 < 2e-16 ***
number_of_reviews -13.756 < 2e-16 ***
review_scores_rating 10.256 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4706 on 4505 degrees of freedom
Multiple R-squared: 0.4847, Adjusted R-squared: 0.4835
F-statistic: 423.7 on 10 and 4505 DF, p-value: < 2.2e-16
#`bathrooms`, `bedrooms`,`accommodates` are significant
#select variables
step(model3)Start: AIC=-6797.36
log(price_4_nights) ~ prop_type_simplified + bathrooms + bedrooms +
beds + accommodates + number_of_reviews + review_scores_rating
Df Sum of Sq RSS AIC
- beds 1 0.140 997.72 -6798.7
<none> 997.59 -6797.4
- bedrooms 1 4.503 1002.09 -6779.0
- bathrooms 1 22.761 1020.35 -6697.5
- review_scores_rating 1 23.293 1020.88 -6695.1
- accommodates 1 28.050 1025.64 -6674.1
- number_of_reviews 1 41.901 1039.49 -6613.6
- prop_type_simplified 4 241.391 1238.98 -5826.7
Step: AIC=-6798.73
log(price_4_nights) ~ prop_type_simplified + bathrooms + bedrooms +
accommodates + number_of_reviews + review_scores_rating
Df Sum of Sq RSS AIC
<none> 997.72 -6798.7
- bedrooms 1 5.492 1003.22 -6775.9
- bathrooms 1 22.982 1020.71 -6697.9
- review_scores_rating 1 23.301 1021.03 -6696.5
- number_of_reviews 1 41.967 1039.69 -6614.7
- accommodates 1 51.635 1049.36 -6572.9
- prop_type_simplified 4 241.583 1239.31 -5827.5
Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + bathrooms +
bedrooms + accommodates + number_of_reviews + review_scores_rating,
data = data3)
Coefficients:
(Intercept)
4.62539
prop_type_simplifiedEntire rental unit
0.06300
prop_type_simplifiedOther
-0.09653
prop_type_simplifiedPrivate room in rental unit
-0.61712
prop_type_simplifiedPrivate room in residential home
-0.60121
bathrooms
0.13854
bedrooms
0.05917
accommodates
0.08543
number_of_reviews
-0.00121
review_scores_rating
0.15097
model3<-lm(log(price_4_nights) ~ prop_type_simplified + bathrooms +
bedrooms + accommodates + number_of_reviews + review_scores_rating,
data = data3)
summary(model3)
Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + bathrooms +
bedrooms + accommodates + number_of_reviews + review_scores_rating,
data = data3)
Residuals:
Min 1Q Median 3Q Max
-3.10477 -0.28674 -0.03398 0.25114 2.68858
Coefficients:
Estimate Std. Error
(Intercept) 4.625e+00 8.101e-02
prop_type_simplifiedEntire rental unit 6.300e-02 4.200e-02
prop_type_simplifiedOther -9.653e-02 4.541e-02
prop_type_simplifiedPrivate room in rental unit -6.171e-01 4.505e-02
prop_type_simplifiedPrivate room in residential home -6.012e-01 5.996e-02
bathrooms 1.385e-01 1.360e-02
bedrooms 5.917e-02 1.188e-02
accommodates 8.543e-02 5.595e-03
number_of_reviews -1.210e-03 8.788e-05
review_scores_rating 1.510e-01 1.472e-02
t value Pr(>|t|)
(Intercept) 57.095 < 2e-16 ***
prop_type_simplifiedEntire rental unit 1.500 0.1337
prop_type_simplifiedOther -2.126 0.0336 *
prop_type_simplifiedPrivate room in rental unit -13.699 < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -10.026 < 2e-16 ***
bathrooms 10.188 < 2e-16 ***
bedrooms 4.980 6.59e-07 ***
accommodates 15.271 < 2e-16 ***
number_of_reviews -13.767 < 2e-16 ***
review_scores_rating 10.258 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4706 on 4506 degrees of freedom
Multiple R-squared: 0.4846, Adjusted R-squared: 0.4836
F-statistic: 470.7 on 9 and 4506 DF, p-value: < 2.2e-16
vif(model3) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.311597 4 1.034487
bathrooms 1.592060 1 1.261769
bedrooms 2.556479 1 1.598899
accommodates 2.786001 1 1.669132
number_of_reviews 1.036908 1 1.018287
review_scores_rating 1.013512 1 1.006733
#No multicollinearity(host_is_superhost) command a pricing premium, after controlling for other variables?data4<-data %>%
dplyr::select(prop_type_simplified,host_is_superhost,bathrooms,bedrooms,accommodates,number_of_reviews,review_scores_rating,price_4_nights)
model4<-lm(log(price_4_nights)~.,data4)
summary(model4)
Call:
lm(formula = log(price_4_nights) ~ ., data = data4)
Residuals:
Min 1Q Median 3Q Max
-3.10195 -0.28855 -0.03563 0.25080 2.69026
Coefficients:
Estimate Std. Error
(Intercept) 4.6707780 0.0825791
prop_type_simplifiedEntire rental unit 0.0586328 0.0420022
prop_type_simplifiedOther -0.0981776 0.0453828
prop_type_simplifiedPrivate room in rental unit -0.6193546 0.0450233
prop_type_simplifiedPrivate room in residential home -0.5991017 0.0599223
host_is_superhost 0.0466049 0.0167494
bathrooms 0.1377639 0.0135915
bedrooms 0.0596835 0.0118733
accommodates 0.0852729 0.0055908
number_of_reviews -0.0012426 0.0000886
review_scores_rating 0.1397705 0.0152464
t value Pr(>|t|)
(Intercept) 56.561 < 2e-16 ***
prop_type_simplifiedEntire rental unit 1.396 0.16280
prop_type_simplifiedOther -2.163 0.03057 *
prop_type_simplifiedPrivate room in rental unit -13.756 < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -9.998 < 2e-16 ***
host_is_superhost 2.782 0.00542 **
bathrooms 10.136 < 2e-16 ***
bedrooms 5.027 5.18e-07 ***
accommodates 15.252 < 2e-16 ***
number_of_reviews -14.025 < 2e-16 ***
review_scores_rating 9.167 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4702 on 4505 degrees of freedom
Multiple R-squared: 0.4855, Adjusted R-squared: 0.4843
F-statistic: 425.1 on 10 and 4505 DF, p-value: < 2.2e-16
vif(model4) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.318734 4 1.035189
host_is_superhost 1.110902 1 1.053993
bathrooms 1.592735 1 1.262036
bedrooms 2.557100 1 1.599093
accommodates 2.786301 1 1.669222
number_of_reviews 1.055401 1 1.027327
review_scores_rating 1.089416 1 1.043751
superhosts does command a pricing premium, after controlling for other variables
instant_bookable == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is instant_bookable a significant predictor of price_4_nights?data5<-data %>%
dplyr::select(prop_type_simplified,instant_bookable,host_is_superhost,bathrooms,bedrooms,accommodates,number_of_reviews,review_scores_rating,price_4_nights)
model5<-lm(log(price_4_nights)~.,data5)
summary(model5)
Call:
lm(formula = log(price_4_nights) ~ ., data = data5)
Residuals:
Min 1Q Median 3Q Max
-3.09429 -0.28770 -0.03699 0.24981 2.68232
Coefficients:
Estimate Std. Error
(Intercept) 4.645e+00 8.382e-02
prop_type_simplifiedEntire rental unit 5.605e-02 4.202e-02
prop_type_simplifiedOther -1.013e-01 4.541e-02
prop_type_simplifiedPrivate room in rental unit -6.160e-01 4.505e-02
prop_type_simplifiedPrivate room in residential home -5.969e-01 5.992e-02
instant_bookable 2.586e-02 1.464e-02
host_is_superhost 4.660e-02 1.675e-02
bathrooms 1.383e-01 1.359e-02
bedrooms 6.079e-02 1.189e-02
accommodates 8.473e-02 5.598e-03
number_of_reviews -1.249e-03 8.866e-05
review_scores_rating 1.423e-01 1.531e-02
t value Pr(>|t|)
(Intercept) 55.419 < 2e-16 ***
prop_type_simplifiedEntire rental unit 1.334 0.18225
prop_type_simplifiedOther -2.230 0.02577 *
prop_type_simplifiedPrivate room in rental unit -13.674 < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -9.962 < 2e-16 ***
instant_bookable 1.766 0.07742 .
host_is_superhost 2.783 0.00541 **
bathrooms 10.174 < 2e-16 ***
bedrooms 5.114 3.28e-07 ***
accommodates 15.135 < 2e-16 ***
number_of_reviews -14.090 < 2e-16 ***
review_scores_rating 9.295 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4701 on 4504 degrees of freedom
Multiple R-squared: 0.4858, Adjusted R-squared: 0.4846
F-statistic: 386.9 on 11 and 4504 DF, p-value: < 2.2e-16
vif(model5) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.357186 4 1.038915
instant_bookable 1.056433 1 1.027829
host_is_superhost 1.110902 1 1.053993
bathrooms 1.593469 1 1.262327
bedrooms 2.564278 1 1.601336
accommodates 2.794873 1 1.671787
number_of_reviews 1.057321 1 1.028261
review_scores_rating 1.099239 1 1.048446
instant_bookable does command a pricing premium, after controlling for other variables.
neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. There are typically more than 20 neighbourhoods in each city, and it wouldn’t make sense to include them all in your model. Use your city knowledge, or ask someone with city knowledge, and see whether you can group neighbourhoods together so the majority of listings falls in fewer (5-6 max) geographical areas. You would thus need to create a new categorical variabale neighbourhood_simplified and determine whether location is a predictor of price_4_nightsdata %>%
count(neighbourhood_group_cleansed) %>%
arrange(desc(n)) | neighbourhood_group_cleansed | n |
|---|---|
| Centro | 2488 |
| Salamanca | 271 |
| Chamberí | 231 |
| Arganzuela | 213 |
| Tetuán | 165 |
| Puente de Vallecas | 139 |
| Hortaleza | 121 |
| Retiro | 117 |
| Carabanchel | 107 |
| Chamartín | 99 |
| Ciudad Lineal | 94 |
| Latina | 86 |
| Moncloa - Aravaca | 78 |
| San Blas - Canillejas | 78 |
| Usera | 69 |
| Fuencarral - El Pardo | 49 |
| Barajas | 37 |
| Villa de Vallecas | 26 |
| Villaverde | 22 |
| Moratalaz | 17 |
| Vicálvaro | 9 |
#there are 21 districts in Madrid,divide them into five districts according to the number of houses and the distance from the Centro
data<-data %>%
mutate(neighbourhood_simplified=
case_when(
neighbourhood_group_cleansed=="Centro"~"loop1",
neighbourhood_group_cleansed %in% c("Salamanca","Chamberí","Arganzuela") ~ "loop2",
neighbourhood_group_cleansed %in% c("Tetuán","Retiro","Puente de Vallecas","Chamartín") ~ "loop3",
neighbourhood_group_cleansed %in% c("Carabanchel","Ciudad Lineal","Moncloa - Aravaca","Latina","San Blas - Canillejas","Hortaleza","Usera") ~ "loop4",
TRUE ~ "loop5"
))
data6<-data %>%
dplyr::select(prop_type_simplified,neighbourhood_simplified,instant_bookable,host_is_superhost,bathrooms,bedrooms,accommodates,number_of_reviews,review_scores_rating,price_4_nights)
model6<-lm(log(price_4_nights)~.,data6)
summary(model6)
Call:
lm(formula = log(price_4_nights) ~ ., data = data6)
Residuals:
Min 1Q Median 3Q Max
-3.00336 -0.27735 -0.03766 0.23279 2.85879
Coefficients:
Estimate Std. Error
(Intercept) 4.727e+00 8.278e-02
prop_type_simplifiedEntire rental unit 7.568e-03 4.138e-02
prop_type_simplifiedOther -1.097e-01 4.457e-02
prop_type_simplifiedPrivate room in rental unit -6.163e-01 4.424e-02
prop_type_simplifiedPrivate room in residential home -5.678e-01 5.889e-02
neighbourhood_simplifiedloop2 1.684e-02 1.981e-02
neighbourhood_simplifiedloop3 -1.140e-01 2.273e-02
neighbourhood_simplifiedloop4 -2.511e-01 2.139e-02
neighbourhood_simplifiedloop5 -2.255e-01 3.836e-02
instant_bookable 9.458e-03 1.444e-02
host_is_superhost 4.800e-02 1.644e-02
bathrooms 1.296e-01 1.336e-02
bedrooms 6.938e-02 1.169e-02
accommodates 8.282e-02 5.503e-03
number_of_reviews -1.366e-03 8.819e-05
review_scores_rating 1.477e-01 1.503e-02
t value Pr(>|t|)
(Intercept) 57.102 < 2e-16 ***
prop_type_simplifiedEntire rental unit 0.183 0.85491
prop_type_simplifiedOther -2.462 0.01386 *
prop_type_simplifiedPrivate room in rental unit -13.929 < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -9.642 < 2e-16 ***
neighbourhood_simplifiedloop2 0.850 0.39523
neighbourhood_simplifiedloop3 -5.017 5.45e-07 ***
neighbourhood_simplifiedloop4 -11.736 < 2e-16 ***
neighbourhood_simplifiedloop5 -5.880 4.41e-09 ***
instant_bookable 0.655 0.51237
host_is_superhost 2.920 0.00352 **
bathrooms 9.702 < 2e-16 ***
bedrooms 5.934 3.18e-09 ***
accommodates 15.051 < 2e-16 ***
number_of_reviews -15.491 < 2e-16 ***
review_scores_rating 9.828 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4612 on 4500 degrees of freedom
Multiple R-squared: 0.5055, Adjusted R-squared: 0.5039
F-statistic: 306.7 on 15 and 4500 DF, p-value: < 2.2e-16
vif(model6) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.441599 4 1.046780
neighbourhood_simplified 1.152689 4 1.017921
instant_bookable 1.067086 1 1.032999
host_is_superhost 1.112446 1 1.054726
bathrooms 1.599750 1 1.264812
bedrooms 2.576954 1 1.605289
accommodates 2.806030 1 1.675121
number_of_reviews 1.086918 1 1.042553
review_scores_rating 1.100501 1 1.049048
anova(model5,model6)| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 4.5e+03 | 995 | ||||
| 4.5e+03 | 957 | 4 | 38.2 | 44.8 | 5.56e-37 |
##F=44.845 with p< 2.2e-16, the location is significantavalability_30 or reviews_per_month on price_4_nights, after we control for other variables?data7<-data %>%
dplyr::select(prop_type_simplified,neighbourhood_simplified,instant_bookable,availability_30,host_is_superhost,bathrooms,bedrooms,accommodates,review_scores_rating,reviews_per_month,price_4_nights)
model7<-lm(log(price_4_nights)~.,data7)
summary(model7)
Call:
lm(formula = log(price_4_nights) ~ ., data = data7)
Residuals:
Min 1Q Median 3Q Max
-3.08748 -0.25785 -0.03334 0.22036 2.63836
Coefficients:
Estimate Std. Error
(Intercept) 4.4281155 0.0798698
prop_type_simplifiedEntire rental unit -0.0112065 0.0386681
prop_type_simplifiedOther -0.1942975 0.0417525
prop_type_simplifiedPrivate room in rental unit -0.7457868 0.0416723
prop_type_simplifiedPrivate room in residential home -0.7275543 0.0554074
neighbourhood_simplifiedloop2 0.0202601 0.0185070
neighbourhood_simplifiedloop3 -0.1387502 0.0212437
neighbourhood_simplifiedloop4 -0.2929612 0.0200530
neighbourhood_simplifiedloop5 -0.2613945 0.0358827
instant_bookable 0.0628427 0.0136325
availability_30 0.0177101 0.0008806
host_is_superhost 0.0975186 0.0154819
bathrooms 0.1356782 0.0124748
bedrooms 0.0720092 0.0109294
accommodates 0.0788687 0.0051457
review_scores_rating 0.1676647 0.0140711
reviews_per_month -0.0762672 0.0043085
t value Pr(>|t|)
(Intercept) 55.442 < 2e-16 ***
prop_type_simplifiedEntire rental unit -0.290 0.772
prop_type_simplifiedOther -4.654 3.36e-06 ***
prop_type_simplifiedPrivate room in rental unit -17.896 < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -13.131 < 2e-16 ***
neighbourhood_simplifiedloop2 1.095 0.274
neighbourhood_simplifiedloop3 -6.531 7.24e-11 ***
neighbourhood_simplifiedloop4 -14.609 < 2e-16 ***
neighbourhood_simplifiedloop5 -7.285 3.79e-13 ***
instant_bookable 4.610 4.14e-06 ***
availability_30 20.112 < 2e-16 ***
host_is_superhost 6.299 3.28e-10 ***
bathrooms 10.876 < 2e-16 ***
bedrooms 6.589 4.95e-11 ***
accommodates 15.327 < 2e-16 ***
review_scores_rating 11.916 < 2e-16 ***
reviews_per_month -17.701 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4311 on 4499 degrees of freedom
Multiple R-squared: 0.568, Adjusted R-squared: 0.5665
F-statistic: 369.8 on 16 and 4499 DF, p-value: < 2.2e-16
vif(model7) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.536214 4 1.055131
neighbourhood_simplified 1.159326 4 1.018652
instant_bookable 1.089101 1 1.043600
availability_30 1.255944 1 1.120689
host_is_superhost 1.129026 1 1.062556
bathrooms 1.596089 1 1.263364
bedrooms 2.577386 1 1.605424
accommodates 2.807799 1 1.675649
review_scores_rating 1.103813 1 1.050625
reviews_per_month 1.158354 1 1.076269
autoplot(model_x)library(ggfortify)
autoplot(model7)car::vif(model7) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.536214 4 1.055131
neighbourhood_simplified 1.159326 4 1.018652
instant_bookable 1.089101 1 1.043600
availability_30 1.255944 1 1.120689
host_is_superhost 1.129026 1 1.062556
bathrooms 1.596089 1 1.263364
bedrooms 2.577386 1 1.605424
accommodates 2.807799 1 1.675649
review_scores_rating 1.103813 1 1.050625
reviews_per_month 1.158354 1 1.076269
huxtable that shows which models you worked on, which predictors are significant, the adjusted \(R^2\), and the Residual Standard Error.library(huxtable)
huxreg(list("model1" = model1, "model2" = model2, "model3" = model3, "model4" = model4,"model5"=model5,"model6"=model6,"model7"=model7))| model1 | model2 | model3 | model4 | model5 | model6 | model7 | |
|---|---|---|---|---|---|---|---|
| (Intercept) | 5.137 *** | 5.171 *** | 4.625 *** | 4.671 *** | 4.645 *** | 4.727 *** | 4.428 *** |
| (0.091) | (0.089) | (0.081) | (0.083) | (0.084) | (0.083) | (0.080) | |
| prop_type_simplifiedEntire rental unit | 0.191 *** | 0.190 *** | 0.063 | 0.059 | 0.056 | 0.008 | -0.011 |
| (0.048) | (0.047) | (0.042) | (0.042) | (0.042) | (0.041) | (0.039) | |
| prop_type_simplifiedOther | -0.091 | 0.198 *** | -0.097 * | -0.098 * | -0.101 * | -0.110 * | -0.194 *** |
| (0.052) | (0.055) | (0.045) | (0.045) | (0.045) | (0.045) | (0.042) | |
| prop_type_simplifiedPrivate room in rental unit | -0.753 *** | -0.129 | -0.617 *** | -0.619 *** | -0.616 *** | -0.616 *** | -0.746 *** |
| (0.051) | (0.067) | (0.045) | (0.045) | (0.045) | (0.044) | (0.042) | |
| prop_type_simplifiedPrivate room in residential home | -0.723 *** | -0.099 | -0.601 *** | -0.599 *** | -0.597 *** | -0.568 *** | -0.728 *** |
| (0.068) | (0.080) | (0.060) | (0.060) | (0.060) | (0.059) | (0.055) | |
| number_of_reviews | -0.001 *** | -0.001 *** | -0.001 *** | -0.001 *** | -0.001 *** | -0.001 *** | |
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | ||
| review_scores_rating | 0.160 *** | 0.153 *** | 0.151 *** | 0.140 *** | 0.142 *** | 0.148 *** | 0.168 *** |
| (0.017) | (0.016) | (0.015) | (0.015) | (0.015) | (0.015) | (0.014) | |
| room_typeHotel room | -0.018 | ||||||
| (0.078) | |||||||
| room_typePrivate room | -0.624 *** | ||||||
| (0.045) | |||||||
| room_typeShared room | -1.205 *** | ||||||
| (0.135) | |||||||
| bathrooms | 0.139 *** | 0.138 *** | 0.138 *** | 0.130 *** | 0.136 *** | ||
| (0.014) | (0.014) | (0.014) | (0.013) | (0.012) | |||
| bedrooms | 0.059 *** | 0.060 *** | 0.061 *** | 0.069 *** | 0.072 *** | ||
| (0.012) | (0.012) | (0.012) | (0.012) | (0.011) | |||
| accommodates | 0.085 *** | 0.085 *** | 0.085 *** | 0.083 *** | 0.079 *** | ||
| (0.006) | (0.006) | (0.006) | (0.006) | (0.005) | |||
| host_is_superhost | 0.047 ** | 0.047 ** | 0.048 ** | 0.098 *** | |||
| (0.017) | (0.017) | (0.016) | (0.015) | ||||
| instant_bookable | 0.026 | 0.009 | 0.063 *** | ||||
| (0.015) | (0.014) | (0.014) | |||||
| neighbourhood_simplifiedloop2 | 0.017 | 0.020 | |||||
| (0.020) | (0.019) | ||||||
| neighbourhood_simplifiedloop3 | -0.114 *** | -0.139 *** | |||||
| (0.023) | (0.021) | ||||||
| neighbourhood_simplifiedloop4 | -0.251 *** | -0.293 *** | |||||
| (0.021) | (0.020) | ||||||
| neighbourhood_simplifiedloop5 | -0.226 *** | -0.261 *** | |||||
| (0.038) | (0.036) | ||||||
| availability_30 | 0.018 *** | ||||||
| (0.001) | |||||||
| reviews_per_month | -0.076 *** | ||||||
| (0.004) | |||||||
| N | 4516 | 4516 | 4516 | 4516 | 4516 | 4516 | 4516 |
| R2 | 0.321 | 0.357 | 0.485 | 0.485 | 0.486 | 0.506 | 0.568 |
| logLik | -3622.171 | -3496.718 | -2998.562 | -2994.685 | -2993.121 | -2904.859 | -2599.746 |
| AIC | 7260.343 | 7015.436 | 6019.124 | 6013.370 | 6012.243 | 5843.719 | 5235.491 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||||||
price_4_nights.#####Model test
library(rsample)
set.seed(1234)
train_test_split<-initial_split(data7,prop=0.75)
data_train<-training(train_test_split)
data_test<-testing(train_test_split)
rmse_train<-data_train %>%
mutate(predictions=predict(model7,.)) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
rmse_train[1] 23052.75
rmse_test<-data_test %>%
mutate(predictions = predict(model7,.)) %>%
summarise(sqrt(sum(predictions - price_4_nights)**2/n())) %>%
pull()
rmse_test[1] 13863.82
summary(model7)
Call:
lm(formula = log(price_4_nights) ~ ., data = data7)
Residuals:
Min 1Q Median 3Q Max
-3.08748 -0.25785 -0.03334 0.22036 2.63836
Coefficients:
Estimate Std. Error
(Intercept) 4.4281155 0.0798698
prop_type_simplifiedEntire rental unit -0.0112065 0.0386681
prop_type_simplifiedOther -0.1942975 0.0417525
prop_type_simplifiedPrivate room in rental unit -0.7457868 0.0416723
prop_type_simplifiedPrivate room in residential home -0.7275543 0.0554074
neighbourhood_simplifiedloop2 0.0202601 0.0185070
neighbourhood_simplifiedloop3 -0.1387502 0.0212437
neighbourhood_simplifiedloop4 -0.2929612 0.0200530
neighbourhood_simplifiedloop5 -0.2613945 0.0358827
instant_bookable 0.0628427 0.0136325
availability_30 0.0177101 0.0008806
host_is_superhost 0.0975186 0.0154819
bathrooms 0.1356782 0.0124748
bedrooms 0.0720092 0.0109294
accommodates 0.0788687 0.0051457
review_scores_rating 0.1676647 0.0140711
reviews_per_month -0.0762672 0.0043085
t value Pr(>|t|)
(Intercept) 55.442 < 2e-16 ***
prop_type_simplifiedEntire rental unit -0.290 0.772
prop_type_simplifiedOther -4.654 3.36e-06 ***
prop_type_simplifiedPrivate room in rental unit -17.896 < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -13.131 < 2e-16 ***
neighbourhood_simplifiedloop2 1.095 0.274
neighbourhood_simplifiedloop3 -6.531 7.24e-11 ***
neighbourhood_simplifiedloop4 -14.609 < 2e-16 ***
neighbourhood_simplifiedloop5 -7.285 3.79e-13 ***
instant_bookable 4.610 4.14e-06 ***
availability_30 20.112 < 2e-16 ***
host_is_superhost 6.299 3.28e-10 ***
bathrooms 10.876 < 2e-16 ***
bedrooms 6.589 4.95e-11 ***
accommodates 15.327 < 2e-16 ***
review_scores_rating 11.916 < 2e-16 ***
reviews_per_month -17.701 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4311 on 4499 degrees of freedom
Multiple R-squared: 0.568, Adjusted R-squared: 0.5665
F-statistic: 369.8 on 16 and 4499 DF, p-value: < 2.2e-16
new_data<-data.frame("Private room in rental unit","loop2",1,30,1,1,2,2,4.6,3)
names(new_data)<-c("prop_type_simplified","neighbourhood_simplified",'instant_bookable','availability_30','host_is_superhost','bathrooms','bedrooms','accommodates','review_scores_rating','reviews_per_month')
price_4_nights_pred<-exp(predict(model7,new_data,interval = "confidence"))
price_4_nights_pred fit lwr upr
1 215.7548 202.3355 230.064
point prediction is 215.7548 interval is:[202.34, 230.06]
Our best model is model7 as 56.8% of the data fit this regression model which represent the highest value of r-squared. There are 14 variables (X) in total that are able to explain in the price for 4 night for 2 people (Y). The most 2 important factors that cost the higher price of hotel in Madrid, Spain are the review score rating and bathroom (number of available and cleaness). However, property types and locations have a negative impact on price for 4 night for 2 people. For example, the higher distance from central will cost the hotel for 4 nights for 2 propel to be lower such as Latine, San Blas - Canillejas or Hortaleza.
After we had imported the raw data, we select only the important variables that will have an impact on our analysis. Then, we declare the variable types; double, character or logic, for example.
To make the property types more clarify, we classified it into 4 main groups which are 1) Entire rental unit 2) Private room in rental unit 3) Private room in residential home and 4) Entire loft and the remaining are Other.
To see the big picture of our data frame including some statistics, we use glimpse(), favstats() and skim()
First, we imported the raw data and clean it to more organized. Then, we ensure data to meet the assumption; normal distribution by using log rather than linear line. After that, we perform the regression analysis, as we add more variables in the regression model, the r-squared keep improving and those variables are statistically significant to explain Y. Once we got the regression models, then we ran the diagnosis to check for homoscedasticity and made the conclusion in comparison table. In order to select the best model, R-squared represents a goodness-of-fit measure for linear regression models, meaning that higher R-squared is better. As a result, we would suggest you to use the model7 that has the highest value of R-squared at 0.568, so that 56.8% of the data fit the regression model.
Focusing on significant variables,ceteris paribus, an increase 1) review_scores_rating 2) bathrooms 3) bedrooms 4) accommodates 5) host_is_superhost 6) instant_bookable and 7) availability_30 will increase the price for 4 nights.
Focusing on significant variables, ceteris paribus, an increase in 1) prop_type_simplifiedOther 2) prop_type_simplifiedPrivate room in rental unit 3)prop_type_simplifiedPrivate room in residential home 4)neighbourhood_simplifiedloop3 5) neighbourhood_simplifiedloop4 6) neighbourhood_simplifiedloop5 and 7) reviews_per_month will decrease the price for 4 nights.
Even though the models are acceptable but the residuals are not normally distributed, so it obey the basic assumptions of OLS regression. Therefore, to improve the analysis, the next step is to modify the residuals to make it normally distributed. To do so, we can use BOX-COC method.
Your work will be assessed on a rubric which you can find here